feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB) by Christopher-Lee-McClendon · Pull Request #1232 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-04-01T20:31:50Z

Scaling study: PR940 architecture at 20k steps achieves 1.0929 BPB (legal TTT)
Improves on prior GEPA 20k (1.0983 BPB) by -0.0054 BPB
FlowRefiner variant at 1.0928 BPB confirms auxiliary flow head is neutral
Trained on 1xA100-40GB, ~10.7h per run
Artifact: 14,473,337 bytes (base), 14,635,871 bytes (flow)
Includes scaling trajectory 7k-20k steps and comparison table

val_bpb = 1.0929 (base) / 1.0928 (flow) | Pre-TTT: 1.1005 / 1.1000 | Artifact: 14.47 MB / 14.64 MB

Non-record unlimited-compute submission (trained on 1×A100-40GB PCIe, ~10.7h per run).

Headline Result

Extending the PR #940 architecture stack to 20,000 steps (8,000 peak-LR + 12,000 warmdown) achieves 1.0929 BPB with legal score-first TTT — improving on our prior GEPA 20k submission (1.0983 BPB) by −0.0054 BPB. This improvement comes entirely from architectural upgrades (gated attention, value residual, all-layer XSA, LeakyReLU²) introduced in the PR #549→PR #940 evolution, applied at the same 20k training scale.

Two configurations were trained:

Base (no auxiliary heads): 27,137,223 params → 1.0929 BPB with legal TTT
FlowRefiner (lightweight flow module): 27,235,848 params → 1.0928 BPB with legal TTT

FlowRefiner adds 98,625 parameters and provides negligible benefit at 20k steps (−0.0005 BPB no-TTT, −0.0001 BPB with TTT) — the auxiliary flow head is essentially neutral at this training budget.

Comparison with Prior 20k Submission

	GEPA 20k (prior work)	PR940 Base 20k (this work)	Δ
Legal TTT BPB	1.0983	1.0929	−0.0054
No-TTT BPB	—	1.1005	—
TTT gain	−0.0170	−0.0076	—
Float base (step 20k)	1.1153	1.1062	−0.0091
Parameters	27,030,107	27,137,223	+107,116
Total submission size	14,985,742 B	14,473,337 B	−512,405 B
Compression	zstd-22	zstd-16	—
Hardware	4×A100-40GB	1×A100-40GB	−3 GPUs
Training time	~10.6h	~10.7h	comparable
XSA layers	Last 4	All 11	+7 layers
Activation	ReLU²	LeakyReLU(0.5)²	—
BigramHash	2048×128	4096×128	2× buckets
Gated attention	No	Yes	new
Value residual	No	Yes	new

The prior GEPA 20k submission achieved a larger TTT gain (−0.017 vs −0.008) because its weaker float base left more room for test-time adaptation. The PR940 stack's stronger float base (1.1062 vs 1.1153) means TTT has less to correct — but the net result is still 0.005 BPB better.

Note: The new submission produces a smaller artifact despite using weaker compression (zstd-16 vs zstd-22). This is due to the PR940 architecture producing better-conditioned weight matrices that compress more efficiently.

Scaling Study: 7k → 20k Steps

Training trajectory showing the warmdown phase (steps 8,000–20,000) is the primary driver of improvement:

Step	Base val_bpb	Flow val_bpb	Phase
7,000	1.2064	1.2065	Peak LR
8,000 (warmdown start)	1.2016	1.2022	← warmdown begins
10,000	1.1898	1.1907	Warmdown
12,000	1.1801	1.1805	Warmdown
14,000	1.1658	1.1666	Warmdown
16,000	1.1511	1.1516	Warmdown
18,000	1.1307	1.1309	Warmdown
20,000	1.1062	1.1062	End

Key observations:

The peak-LR plateau (steps 1–8k) saturates around 1.20 BPB
The warmdown phase (steps 8k–20k) drives the model from 1.20 → 1.11, a gain of −0.094 BPB
Base and Flow track within 0.001 BPB throughout training — the FlowRefiner does not diverge at longer schedules
Diminishing returns: ~7.8 mbpb/kstep from step 8k→14k, ~4.9 mbpb/kstep from step 14k→20k

Quantized Evaluation Summary

Configuration	Params	No TTT (BPB)	Legal TTT (BPB)	TTT Gain	Artifact
Base 20k	27,137,223	1.10050	1.09292	−0.00758	14,473,337 B
Flow 20k	27,235,848	1.10002	1.09279	−0.00724	14,635,871 B
Δ (Flow − Base)	+98,625	−0.00048	−0.00014	—	+162,534 B

Architecture Summary

Component	Configuration
Layers	11
Embedding dim	512
Heads	8 query, 4 KV (GQA)
MLP	3× expansion (1536), LeakyReLU(0.5)²
Vocab	1024 (SentencePiece BPE)
Sequence length	2048
BigramHash	4096 buckets, 128-dim
RoPE	Partial 16/64, base 10000
LN Scale	Depth-scaled `1/√(layer+1)`
XSA	All 11 layers
Value residual	Yes
Gated attention	Yes (QK gain init 1.5)
Logit softcap	30.0
SmearGate	Yes
Tied embeddings	Yes
EMA	decay 0.997

FlowRefiner (supplementary config only)

98,625 additional parameters
Lightweight logit correction network trained jointly with AR objective
FLOW_ENABLED=1 environment variable

Training Details

Setting	Value
Hardware	1×A100-40GB PCIe
Steps	20,000
Peak LR phase	Steps 0–8,000
Warmdown	Cosine steps 8,000–20,000 (12,000 steps, 60%)
Warmup	20 steps
Batch size	786,432 tokens
Matrix LR (Muon)	0.025
Scalar LR (Adam)	0.025
Embed LR	0.035
Weight decay	0.04
Grad clip	0.3
Muon momentum	0.99
EMA decay	0.997
Step avg time	~1.92s (base), ~1.96s (flow)
Total train time	~10.7h (base), ~10.9h (flow)

Quantization Details

Setting	Value
Method	Int6 per-row with GPTQ-lite clip search
Compression	zstd-16
Embedding quant	Int6
Mixed quant	Auto int5 fallback if needed
Base artifact	14,473,337 bytes (14.47 MB)
Flow artifact	14,635,871 bytes (14.64 MB)
Budget headroom	1.53 MB / 1.36 MB

TTT (Test-Time Training) Details

Setting	Value
Protocol	Legal score-first (evaluate before training)
Optimizer	SGD with momentum 0.9
Learning rate	0.002
Epochs	10 per chunk
Chunk size	32,768 tokens
Frozen blocks	First 2
Grad clip	1.0
Stride	64
Eval time	~2.0h (base TTT), ~0.5h (no-TTT)

SLURM Job Provenance

Run	Job ID	Description
Base 20k train	55364163	`slurm_pr940_base_20k_ttt.sh`
Flow 20k train	55364164	`slurm_pr940_flow_20k_ttt.sh`
Base 20k eval (no TTT)	55372104	`eval_base20k_nottt`
Base 20k eval (legal TTT)	55372106	`eval_base20k_legal_ttt`
Flow 20k eval (no TTT)	55372105	`eval_flow20k_nottt`
Flow 20k eval (legal TTT)	55372109	`eval_flow20k_legal_ttt`

Training script: train_gpt_pr940.py (2601 lines), environment variables control all configuration.

Credits

Base architecture and gated attention/value residual (PR #940/#549, @abaybektursun), Muon optimizer (baseline), BigramHash/SmearGate (PR #65, @aquariouserworkman), XSA (PR #187/#265, @Idan3011/@unnir), mixed quant (PR #76), sliding window eval (PR #50, @mattqlf), legal score-first TTT (PR #77, @samacqua), VE/PartialRoPE/LN Scale (PR #315/#374, @jfprincz/@unnir), EMA (PR #65, @aquariouserworkman), LeakyReLU² (PR #549, @abaybektursun), GEPA 20k prior work (@mcclec07), FlowRefiner (PR #1170, @mcclec07), scaling study and this submission (@mcclec07).

- Scaling study: PR940 architecture at 20k steps achieves 1.0929 BPB (legal TTT) - Improves on prior GEPA 20k (1.0983 BPB) by -0.0054 BPB - FlowRefiner variant at 1.0928 BPB confirms auxiliary flow head is neutral - Trained on 1xA100-40GB, ~10.7h per run - Artifact: 14,473,337 bytes (base), 14,635,871 bytes (flow) - Includes scaling trajectory 7k-20k steps and comparison table

MatoTeziTanka · 2026-04-11T20:04:33Z

Community Review — feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)

BPB: 1.0929 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern)

What I found in the code (head SHA b82a6294c555, file records/track_non_record_16mb/2026-04-01_11L_PR940_20kSteps_ScalingStudy_LegalTTT/train_gpt.py):

The n-gram lookup key at line 1475 is constructed by XOR-ing the target token into the hash:

line 1475: full_key = <hash> ^ (tgt_np * ng_primes[...]) & mask

This matches the full_key = ((ctx_hash ^ (target * primes[k])) & mask) construction that @valerio-oai ruled disallowed on PR #779 (comment 4145781641, 2026-03-27). Per the mechanism explanation, hashing the target token into the lookup key only reweights the correct token — in the hash-collision limit this drives P(correct) → 1 regardless of the data, which inflates the reported BPB without producing real compression.

Per Issue #1017 condition 1, p_t may depend only on the artifact and x_1...x_{t-1}. Because the lookup key at line 1475 is a function of the target token, the count read at scoring position t depends on x_t itself — which is the core violation the #779 ruling targets.

Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class).

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.11s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.11s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Christopher-Lee-McClendon changed the title ~~feat: Non-record 11L PR940 Stack + 20k Steps + Legal TTT (1.0929 BPB)~~ feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB) Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)#1232

feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)#1232
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-pr940-20k-scaling-legal-ttt

Christopher-Lee-McClendon commented Apr 1, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Christopher-Lee-McClendon commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Headline Result

Comparison with Prior 20k Submission

Scaling Study: 7k → 20k Steps

Quantized Evaluation Summary

Architecture Summary

FlowRefiner (supplementary config only)

Training Details

Quantization Details

TTT (Test-Time Training) Details

SLURM Job Provenance

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Christopher-Lee-McClendon commented Apr 1, 2026 •

edited

Loading